feat: add array_normalize scalar function#22013
Conversation
| let mut new_values: Vec<f64> = Vec::with_capacity(values.len()); | ||
| let mut new_offsets: Vec<O> = Vec::with_capacity(list_array.len() + 1); | ||
| new_offsets.push(O::usize_as(0)); | ||
| let mut validity: Vec<bool> = Vec::with_capacity(list_array.len()); |
There was a problem hiding this comment.
Use NullBufferBuilder here instead. One benefit is when finishing it, it may output None if there are no nulls (currently we always provide a null buffer even if there are no nulls)
There was a problem hiding this comment.
Swapped to NullBufferBuilder — append_null() / append_non_null() per row, nulls.finish() returns None when no nulls accumulated, so we stop emitting a redundant null buffer on all-valid inputs. Thanks.
| let offsets = list_array.value_offsets(); | ||
|
|
||
| let mut new_values: Vec<f64> = Vec::with_capacity(values.len()); | ||
| let mut new_offsets: Vec<O> = Vec::with_capacity(list_array.len() + 1); |
There was a problem hiding this comment.
I think it might be simpler to use OffsetBufferBuilder here
There was a problem hiding this comment.
Thanks @Jefffrey — swapped to OffsetBufferBuilder in c3576a30e. Each branch now uses push_length(0) for null/null-element/empty/zero-mag rows and push_length(len) for valid rows; final buffer from new_offsets.finish(). Cleaner than the manual Vec<O> + OffsetBuffer::new(... .into()).
557d221 to
c3576a3
Compare
|
Thanks @crm26 |
Adds `array_add(array1, array2)` returning the element-wise sum of two numeric arrays. Aliased as `list_add`. Follows the per-function split pattern established by cosine_distance (apache#21542), inner_product (apache#21861), and array_normalize (apache#22013) per tracking issue apache#21536. Semantics: - NULL row in either input -> NULL row out - NULL element at position i in either input -> NULL element at i out (per-element propagation, divergent from inner_product which nulls the whole row; chosen because output is a list, not a scalar) - Length mismatch between rows -> exec_err - Empty arrays -> empty array Supports List, LargeList, and FixedSizeList inputs; numeric element types are coerced to Float64. If any input is LargeList, both sides are widened to LargeList for homogeneous runtime dispatch. Uses OffsetBufferBuilder + NullBufferBuilder per the pattern adopted in array_normalize round 1.
Adds `array_add(array1, array2)` returning the element-wise sum of two numeric arrays. Aliased as `list_add`. Follows the per-function split pattern established by cosine_distance (apache#21542), inner_product (apache#21861), and array_normalize (apache#22013) per tracking issue apache#21536. Semantics: - NULL row in either input -> NULL row out - NULL element at position i in either input -> NULL element at i out (per-element propagation, divergent from inner_product which nulls the whole row; chosen because output is a list, not a scalar) - Length mismatch between rows -> exec_err - Empty arrays -> empty array Supports List, LargeList, and FixedSizeList inputs; numeric element types are coerced to Float64. If any input is LargeList, both sides are widened to LargeList for homogeneous runtime dispatch. Uses OffsetBufferBuilder + NullBufferBuilder per the pattern adopted in array_normalize round 1.
Adds `array_add(array1, array2)` returning the element-wise sum of two numeric arrays. Aliased as `list_add`. Follows the per-function split pattern established by cosine_distance (apache#21542), inner_product (apache#21861), and array_normalize (apache#22013) per tracking issue apache#21536. Semantics: - NULL row in either input -> NULL row out - NULL element at position i in either input -> NULL element at i out (per-element propagation, divergent from inner_product which nulls the whole row; chosen because output is a list, not a scalar) - Length mismatch between rows -> exec_err - Empty arrays -> empty array Supports List, LargeList, and FixedSizeList inputs; numeric element types are coerced to Float64. If any input is LargeList, both sides are widened to LargeList for homogeneous runtime dispatch. Uses OffsetBufferBuilder + NullBufferBuilder per the pattern adopted in array_normalize round 1.
## Which issue does this PR close? Partial of apache#21536 — `array_scale` (the list+scalar arithmetic function in the vector math series). ## Rationale for this change Continues the per-function split requested by @alamb on apache#21536. Three sibling PRs already merged: `cosine_distance` (apache#21542), `inner_product` (apache#21861), `array_normalize` (apache#22013). `array_add` is in flight as apache#22459 by @SubhamSinghal. Adds element-wise scalar multiplication for numeric arrays, returning a list of the same shape. Aliased as `list_scale` to match the `array_X` / `list_X` precedent in this crate. ## What changes are included in this PR? - New scalar UDF `array_scale(array, scalar)` in `datafusion/functions-nested/src/array_scale.rs` - Module wire-up + registration in `datafusion/functions-nested/src/lib.rs` - SLT tests at `datafusion/sqllogictest/test_files/array_scale.slt` - Auto-generated function docs entry in `docs/source/user-guide/sql/scalar_functions.md` **Signature:** first arg `List/LargeList/FixedSizeList<numeric>`, second arg numeric scalar. Both coerce to `Float64`. Same list-widening rules as the binary-op siblings. **NULL semantics:** - NULL row in array → NULL row out - NULL scalar → NULL row out (whole-row, because the scalar applies uniformly) - NULL element at position \`i\` → NULL element at \`i\` out (per-element propagation) - Empty array → empty array **Builders:** uses \`OffsetBufferBuilder\` + \`NullBufferBuilder\` per the pattern adopted in the round-1 review of apache#22013. ## Are these changes tested? Yes. \`array_scale.slt\` covers: - Happy paths (positive, negative, zero, fractional, single-element) - NULL propagation at all three levels (NULL row, NULL scalar, NULL element) - All list type variants (\`List\`, \`LargeList\`, \`FixedSizeList\`) - Numeric inner type coercion (Float32, Int64, integer literals) - Multi-row queries with both constant-scalar broadcast and per-row column scalar - Error paths (non-numeric scalar, non-list first arg, wrong arity) - Empty array - \`list_scale\` alias ## Are there any user-facing changes? Yes — new SQL scalar function \`array_scale(array, scalar)\` and its alias \`list_scale\`. Documented in \`docs/source/user-guide/sql/scalar_functions.md\`.
Adds `array_sum(array)` returning the sum of elements in a numeric array. Aliased as `list_sum`. Part of the per-function split sequence on tracking issue apache#21536, following the pattern of the already-merged PRs in this series (cosine_distance apache#21542, inner_product apache#21861, array_normalize apache#22013, array_scale apache#22466). Semantics: - NULL row in array -> NULL row out - NULL elements are skipped (SQL aggregate convention; matches PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose every element is NULL yields NULL. - Empty array -> 0.0 (additive identity, matches SQL SUM over no rows conceptually, and DuckDB list_sum([]) = 0) Input is List/LargeList/FixedSizeList of any numeric type; elements are coerced to Float64. Output is Float64.
Adds `array_sum(array)` returning the sum of elements in a numeric array. Aliased as `list_sum`. Part of the per-function split sequence on tracking issue apache#21536, following the pattern of the already-merged PRs in this series (cosine_distance apache#21542, inner_product apache#21861, array_normalize apache#22013, array_scale apache#22466). Semantics: - NULL row in array -> NULL row out - NULL elements are skipped (SQL aggregate convention; matches PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose every element is NULL yields NULL. - Empty array -> 0.0 (additive identity, matches SQL SUM over no rows conceptually, and DuckDB list_sum([]) = 0) Input is List/LargeList/FixedSizeList of any numeric type; elements are coerced to Float64. Output is Float64.
Adds `array_sum(array)` returning the sum of elements in a numeric array. Aliased as `list_sum`. Part of the per-function split sequence on tracking issue apache#21536, following the pattern of the already-merged PRs in this series (cosine_distance apache#21542, inner_product apache#21861, array_normalize apache#22013, array_scale apache#22466). Semantics: - NULL row in array -> NULL row out - NULL elements are skipped (SQL aggregate convention; matches PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose every element is NULL yields NULL. - Empty array -> 0.0 (additive identity, matches SQL SUM over no rows conceptually, and DuckDB list_sum([]) = 0) Input is List/LargeList/FixedSizeList of any numeric type; elements are coerced to Float64. Output is Float64.
Adds `array_sum(array)` returning the sum of elements in a numeric array. Aliased as `list_sum`. Part of the per-function split sequence on tracking issue apache#21536, following the pattern of the already-merged PRs in this series (cosine_distance apache#21542, inner_product apache#21861, array_normalize apache#22013, array_scale apache#22466). Semantics: - NULL row in array -> NULL row out - NULL elements are skipped (SQL aggregate convention; matches PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose every element is NULL yields NULL. - Empty array -> 0.0 (additive identity, matches SQL SUM over no rows conceptually, and DuckDB list_sum([]) = 0) Input is List/LargeList/FixedSizeList of any numeric type; elements are coerced to Float64. Output is Float64.
Which issue does this PR close?
Part of #21536 — split of #21371 into one-function-per-PR. Third in the series after #21542 (cosine_distance) and #21861 (inner_product).
Rationale for this change
Adds
array_normalize(array)— the L2-normalized version of a numeric input vector. Computed asarray[i] / sqrt(sum(array[i]^2))per element. Returns the same shape as the input (List<Float64>orLargeList<Float64>).Aliased as
list_normalizeto match thearray_X/list_Xconvention used across the crate.What changes are included in this PR?
Coercion shell mirrors the merged cosine_distance/inner_product pattern:
coerce_typesacceptsList/LargeList/FixedSizeListof any numeric inner type, plus bareNULL. After coercion the inner function only seesList(Float64)orLargeList(Float64).as_float64_array(list_array.values())downcast plusvalue_offsets()slicing — no per-row downcasts.Vec<f64>for values,Vec<O>for offsets,NullBufferfor row validity.Per-row semantics:
sqrt(sum-of-squares)Are these changes tested?
Yes. SLT covers:
NULLinput, NULL element in list, zero vector, empty arrayLargeList,FixedSizeList(via coercion),Float32andInt64inner types, integer literalsList(Float64))list_normalizealias coverage (constant + multi-row with NULL)Are there any user-facing changes?
New scalar function
array_normalize(aliaslist_normalize), documented indocs/source/user-guide/sql/scalar_functions.md.